In this homework, the objectives are to
Use R to preprocess and examine a biomedical dataset
Implement multiple unsupervised learning methods in a real-world biomedical problem, including: Principal Component Analysis, Hierarchical Clustering, and K-means Clustering in R
Visualize and understand principal components, hierarchical clustering dendrograms, and the outputs of K-means clustering in R
Assignments will only be accepted in electronic format in RMarkdown (.rmd) files and knitted .html files. You must submit both the RMD file and HTML file for each HW. 5 points will be deducted for every assignment submission that does not include either the RMarkdown file or the knitted html file. Your code should be adequately commented to clearly explain the steps you used to produce the analyses. RMarkdown homework files should be uploaded to Sakai with the naming convention date_lastname_firstname_HW[X].Rmd. For example, my first homework assignment would be named 20200911_Dunn_Jessilyn_HW1.Rmd. It is important to note that 5 points will be deducted for every assignment that is named improperly. Please add your answer to each question directly after the question prompt in the homework .Rmd file template provided below.
library(tidyverse)
library(ggplot2)
library(lubridate)
library(patchwork)
library(gridExtra)
library(psych)
library(corrplot)
library(ggfortify)
library(factoextra)
Breast Cancer Data https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)
data<-data.frame(read_csv(file = "HW3_data.csv"))
## Parsed with column specification:
## cols(
## .default = col_double(),
## diagnosis = col_character()
## )
## See spec(...) for full column specifications.
head(data, 10)
## id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1 842302 M 17.99 10.38 122.80 1001.0
## 2 842517 M 20.57 17.77 132.90 1326.0
## 3 84300903 M 19.69 21.25 130.00 1203.0
## 4 84348301 M 11.42 20.38 77.58 386.1
## 5 84358402 M 20.29 14.34 135.10 1297.0
## 6 843786 M 12.45 15.70 82.57 477.1
## 7 844359 M 18.25 19.98 119.60 1040.0
## 8 84458202 M 13.71 20.83 90.20 577.9
## 9 844981 M 13.00 21.82 87.50 519.8
## 10 84501001 M 12.46 24.04 83.97 475.9
## smoothness_mean compactness_mean concavity_mean concave.points_mean
## 1 0.11840 0.27760 0.30010 0.14710
## 2 0.08474 0.07864 0.08690 0.07017
## 3 0.10960 0.15990 0.19740 0.12790
## 4 0.14250 0.28390 0.24140 0.10520
## 5 0.10030 0.13280 0.19800 0.10430
## 6 0.12780 0.17000 0.15780 0.08089
## 7 0.09463 0.10900 0.11270 0.07400
## 8 0.11890 0.16450 0.09366 0.05985
## 9 0.12730 0.19320 0.18590 0.09353
## 10 0.11860 0.23960 0.22730 0.08543
## symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se
## 1 0.2419 0.07871 1.0950 0.9053 8.589
## 2 0.1812 0.05667 0.5435 0.7339 3.398
## 3 0.2069 0.05999 0.7456 0.7869 4.585
## 4 0.2597 0.09744 0.4956 1.1560 3.445
## 5 0.1809 0.05883 0.7572 0.7813 5.438
## 6 0.2087 0.07613 0.3345 0.8902 2.217
## 7 0.1794 0.05742 0.4467 0.7732 3.180
## 8 0.2196 0.07451 0.5835 1.3770 3.856
## 9 0.2350 0.07389 0.3063 1.0020 2.406
## 10 0.2030 0.08243 0.2976 1.5990 2.039
## area_se smoothness_se compactness_se concavity_se concave.points_se
## 1 153.40 0.006399 0.04904 0.05373 0.01587
## 2 74.08 0.005225 0.01308 0.01860 0.01340
## 3 94.03 0.006150 0.04006 0.03832 0.02058
## 4 27.23 0.009110 0.07458 0.05661 0.01867
## 5 94.44 0.011490 0.02461 0.05688 0.01885
## 6 27.19 0.007510 0.03345 0.03672 0.01137
## 7 53.91 0.004314 0.01382 0.02254 0.01039
## 8 50.96 0.008805 0.03029 0.02488 0.01448
## 9 24.32 0.005731 0.03502 0.03553 0.01226
## 10 23.94 0.007149 0.07217 0.07743 0.01432
## symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst
## 1 0.03003 0.006193 25.38 17.33 184.60
## 2 0.01389 0.003532 24.99 23.41 158.80
## 3 0.02250 0.004571 23.57 25.53 152.50
## 4 0.05963 0.009208 14.91 26.50 98.87
## 5 0.01756 0.005115 22.54 16.67 152.20
## 6 0.02165 0.005082 15.47 23.75 103.40
## 7 0.01369 0.002179 22.88 27.66 153.20
## 8 0.01486 0.005412 17.06 28.14 110.60
## 9 0.02143 0.003749 15.49 30.73 106.20
## 10 0.01789 0.010080 15.09 40.68 97.65
## area_worst smoothness_worst compactness_worst concavity_worst
## 1 2019.0 0.1622 0.6656 0.7119
## 2 1956.0 0.1238 0.1866 0.2416
## 3 1709.0 0.1444 0.4245 0.4504
## 4 567.7 0.2098 0.8663 0.6869
## 5 1575.0 0.1374 0.2050 0.4000
## 6 741.6 0.1791 0.5249 0.5355
## 7 1606.0 0.1442 0.2576 0.3784
## 8 897.0 0.1654 0.3682 0.2678
## 9 739.3 0.1703 0.5401 0.5390
## 10 711.4 0.1853 1.0580 1.1050
## concave.points_worst symmetry_worst fractal_dimension_worst age
## 1 0.2654 0.4601 0.11890 NA
## 2 0.1860 0.2750 0.08902 NA
## 3 0.2430 0.3613 0.08758 NA
## 4 0.2575 0.6638 0.17300 NA
## 5 0.1625 0.2364 0.07678 NA
## 6 0.1741 0.3985 0.12440 NA
## 7 0.1932 0.3063 0.08368 28
## 8 0.1556 0.3196 0.11510 NA
## 9 0.2060 0.4378 0.10720 NA
## 10 0.2210 0.4366 0.20750 NA
nrow(data)
## [1] 569
Answer a. There are 569 observation in the dataset. b. How many independent variables are there?
total_independent_variables = length(names(data))-2
#Ignoring the ID
print(total_independent_variables)
## [1] 31
Answer b. Total Independent variable = 31 (Excluding the diagnosis and the patient id) c. Which variables have missing values and how many values were missing in each?
var_name<-names(data)#Variable storing the variables in a data frame
logic_missing=list()
for(i in var_name){
if(NA %in% data[[i]])
logic_missing<-append(logic_missing, i)
}
print(logic_missing)
## [[1]]
## [1] "texture_mean"
##
## [[2]]
## [1] "age"
for(a in logic_missing)
{ print(a)
print(sum(as.numeric(as.logical(is.na(data[a])))))
}
## [1] "texture_mean"
## [1] 5
## [1] "age"
## [1] 542
Answer c. Hence the data is missing in texture_mean and age variable; there are 5 missing data in texture_mean and 542 missing data in age
summary(as.factor(data$diagnosis))
## B M
## 357 212
Answer d.There are total 357 Benign and 212 Malignant tumour, i.e. 357 and 212 patients with benign and malignant tumour respectively. For this question, please type your answers clearly outside of R chunks, and do not just show the result of running your codes.
for(a in logic_missing)
{ print(a)
print(sum(as.numeric(as.logical(is.na(data[a])))))
}
## [1] "texture_mean"
## [1] 5
## [1] "age"
## [1] 542
Answer 3. Age variable has the largest number of NA values(542) i.e 95.4% if data is missing. We should not impute the this column because if we carry out complete case analysis on this, then it will result in removing out lot of rows. If we do other kind of statistical imputationn such as mean imputation or otherwise, then we don't have enough number of data in the age column to get a distribution from which we can chose.
data_modified<-mutate(data, age=NULL)#Removing the age column with large number of NA observations.
data_modified[["id"]]<-c(1:length(data$id))
data_modified<-data_modified %>% rename(Index = id)
str(data_modified)
## 'data.frame': 569 obs. of 32 variables:
## $ Index : int 1 2 3 4 5 6 7 8 9 10 ...
## $ diagnosis : chr "M" "M" "M" "M" ...
## $ radius_mean : num 18 20.6 19.7 11.4 20.3 ...
## $ texture_mean : num 10.4 17.8 21.2 20.4 14.3 ...
## $ perimeter_mean : num 122.8 132.9 130 77.6 135.1 ...
## $ area_mean : num 1001 1326 1203 386 1297 ...
## $ smoothness_mean : num 0.1184 0.0847 0.1096 0.1425 0.1003 ...
## $ compactness_mean : num 0.2776 0.0786 0.1599 0.2839 0.1328 ...
## $ concavity_mean : num 0.3001 0.0869 0.1974 0.2414 0.198 ...
## $ concave.points_mean : num 0.1471 0.0702 0.1279 0.1052 0.1043 ...
## $ symmetry_mean : num 0.242 0.181 0.207 0.26 0.181 ...
## $ fractal_dimension_mean : num 0.0787 0.0567 0.06 0.0974 0.0588 ...
## $ radius_se : num 1.095 0.543 0.746 0.496 0.757 ...
## $ texture_se : num 0.905 0.734 0.787 1.156 0.781 ...
## $ perimeter_se : num 8.59 3.4 4.58 3.44 5.44 ...
## $ area_se : num 153.4 74.1 94 27.2 94.4 ...
## $ smoothness_se : num 0.0064 0.00522 0.00615 0.00911 0.01149 ...
## $ compactness_se : num 0.049 0.0131 0.0401 0.0746 0.0246 ...
## $ concavity_se : num 0.0537 0.0186 0.0383 0.0566 0.0569 ...
## $ concave.points_se : num 0.0159 0.0134 0.0206 0.0187 0.0188 ...
## $ symmetry_se : num 0.03 0.0139 0.0225 0.0596 0.0176 ...
## $ fractal_dimension_se : num 0.00619 0.00353 0.00457 0.00921 0.00511 ...
## $ radius_worst : num 25.4 25 23.6 14.9 22.5 ...
## $ texture_worst : num 17.3 23.4 25.5 26.5 16.7 ...
## $ perimeter_worst : num 184.6 158.8 152.5 98.9 152.2 ...
## $ area_worst : num 2019 1956 1709 568 1575 ...
## $ smoothness_worst : num 0.162 0.124 0.144 0.21 0.137 ...
## $ compactness_worst : num 0.666 0.187 0.424 0.866 0.205 ...
## $ concavity_worst : num 0.712 0.242 0.45 0.687 0.4 ...
## $ concave.points_worst : num 0.265 0.186 0.243 0.258 0.163 ...
## $ symmetry_worst : num 0.46 0.275 0.361 0.664 0.236 ...
## $ fractal_dimension_worst: num 0.1189 0.089 0.0876 0.173 0.0768 ...
Answer 4. First let us change the diagnosis group into factor, so that we can utilise it well. Now I will be applying category wise(Benign/Malignant) mean imputation( at type of statistical imputation) to the missing values of the column 'texture_mean' of the given dataset. The reason we should impute differently accross the different diagnosis group is because it is expected that the given predictor/variable will have different distribution for different categories of diagnosis(be it Benign or Malignant)
data_modified$diagnosis<-as.factor(data_modified$diagnosis)
#Applying mean imputation by groups
categories<-unique(data_modified$diagnosis)
impute_fun<-function(df, x){
a1<-df %>%filter(diagnosis == x)
a1$texture_mean[which(is.na(a1$texture_mean))]<-mean(a1$texture_mean, na.rm= T)
return(a1)
}
a4<-lapply(categories, function(x)impute_fun(data_modified, x))
row_count<-sapply(a4, nrow)
data_final<-do.call(rbind,a4)
data_final<-arrange(data_final, Index)
ggplot(gather(data_final[, 3:32]), aes(x=value))+geom_histogram(bins = 20)+labs(x = "Values", y ="Frequency", title="Histograms")+theme(plot.title = element_text(hjust = 0.5))+facet_wrap(~key, scale = 'free_x', ncol=3)
#ggplot(aes(val, fill = what)) + geom_histogram() + facet_wrap(c(3:32))
df_final<-mutate_at(data_final, vars(contains("area")|contains("concavity")), function(x)log(x+1))
Gather the above data set for plotting the histogram
ggplot(gather(df_final[, 3:32]), aes(value))+geom_histogram(bins = 20)+labs(x = "Values", y="Frequency", title="Histograms")+theme(plot.title = element_text(hjust = 0.5))+facet_wrap(~key, scale = 'free_x', ncol=3)
Answer 7. The pre-processed dataset need to be scaled before performing PCA. This is so that the PCA projects the original data into mutually independent/orthogonal directions in order to maximise thhe variance in each variable. So a proper scaling results in similar mean values for all the variable(which is zero in this case) and hence the variable projection is only dependent on the difference in variance/standard deviation. It also takes care of differences in units/magnitude of different types of variables.
if(!("robustHD" %in% installed.packages()))
install.packages("robustHD")
library("robustHD")
## Loading required package: perry
## Loading required package: parallel
## Loading required package: robustbase
for(i in c(3:32))
df_final[,i]<-standardize(df_final[, i], centerFun = mean, scaleFun = sd)
summary(df_final)
## Index diagnosis radius_mean texture_mean perimeter_mean
## Min. : 1 B:357 Min. :-2.0279 Min. :-2.2388 Min. :-1.9828
## 1st Qu.:143 M:212 1st Qu.:-0.6888 1st Qu.:-0.7224 1st Qu.:-0.6913
## Median :285 Median :-0.2149 Median :-0.1019 Median :-0.2358
## Mean :285 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:427 3rd Qu.: 0.4690 3rd Qu.: 0.5770 3rd Qu.: 0.4992
## Max. :569 Max. : 3.9678 Max. : 4.6596 Max. : 3.9726
## area_mean smoothness_mean compactness_mean concavity_mean
## Min. :-2.8860 Min. :-3.10935 Min. :-1.6087 Min. :-1.1774
## 1st Qu.:-0.6672 1st Qu.:-0.71034 1st Qu.:-0.7464 1st Qu.:-0.7619
## Median :-0.1065 Median :-0.03486 Median :-0.2217 Median :-0.3256
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6198 3rd Qu.: 0.63564 3rd Qu.: 0.4934 3rd Qu.: 0.5746
## Max. : 3.0268 Max. : 4.76672 Max. : 4.5644 Max. : 3.8920
## concave.points_mean symmetry_mean fractal_dimension_mean
## Min. :-1.2607 Min. :-2.74171 Min. :-1.8183
## 1st Qu.:-0.7373 1st Qu.:-0.70262 1st Qu.:-0.7220
## Median :-0.3974 Median :-0.07156 Median :-0.1781
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6464 3rd Qu.: 0.53031 3rd Qu.: 0.4706
## Max. : 3.9245 Max. : 4.48081 Max. : 4.9066
## radius_se texture_se perimeter_se area_se
## Min. :-1.0590 Min. :-1.5529 Min. :-1.0431 Min. :-1.9362
## 1st Qu.:-0.6230 1st Qu.:-0.6942 1st Qu.:-0.6232 1st Qu.:-0.6865
## Median :-0.2920 Median :-0.1973 Median :-0.2864 Median :-0.2568
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.2659 3rd Qu.: 0.4661 3rd Qu.: 0.2428 3rd Qu.: 0.5832
## Max. : 8.8991 Max. : 6.6494 Max. : 9.4537 Max. : 4.0749
## smoothness_se compactness_se concavity_se concave.points_se
## Min. :-1.7745 Min. :-1.2970 Min. :-1.1284 Min. :-1.9118
## 1st Qu.:-0.6235 1st Qu.:-0.6923 1st Qu.:-0.5833 1st Qu.:-0.6739
## Median :-0.2201 Median :-0.2808 Median :-0.1981 Median :-0.1404
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3680 3rd Qu.: 0.3893 3rd Qu.: 0.3708 3rd Qu.: 0.4722
## Max. : 8.0229 Max. : 6.1381 Max. :11.0139 Max. : 6.6438
## symmetry_se fractal_dimension_se radius_worst texture_worst
## Min. :-1.5315 Min. :-1.0960 Min. :-1.7254 Min. :-2.22204
## 1st Qu.:-0.6511 1st Qu.:-0.5846 1st Qu.:-0.6743 1st Qu.:-0.74797
## Median :-0.2192 Median :-0.2297 Median :-0.2688 Median :-0.04348
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3554 3rd Qu.: 0.2884 3rd Qu.: 0.5216 3rd Qu.: 0.65776
## Max. : 7.0657 Max. : 9.8429 Max. : 4.0906 Max. : 3.88249
## perimeter_worst area_worst smoothness_worst compactness_worst
## Min. :-1.6919 Min. :-2.5092 Min. :-2.6803 Min. :-1.4426
## 1st Qu.:-0.6890 1st Qu.:-0.6689 1st Qu.:-0.6906 1st Qu.:-0.6805
## Median :-0.2857 Median :-0.1521 Median :-0.0468 Median :-0.2693
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5398 3rd Qu.: 0.6712 3rd Qu.: 0.5970 3rd Qu.: 0.5392
## Max. : 4.2836 Max. : 3.1371 Max. : 3.9519 Max. : 5.1084
## concavity_worst concave.points_worst symmetry_worst
## Min. :-1.4782 Min. :-1.7435 Min. :-2.1591
## 1st Qu.:-0.7766 1st Qu.:-0.7557 1st Qu.:-0.6413
## Median :-0.1557 Median :-0.2233 Median :-0.1273
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6201 3rd Qu.: 0.7119 3rd Qu.: 0.4497
## Max. : 3.7763 Max. : 2.6835 Max. : 6.0407
## fractal_dimension_worst
## Min. :-1.6004
## 1st Qu.:-0.6913
## Median :-0.2163
## Mean : 0.0000
## 3rd Qu.: 0.4504
## Max. : 6.8408
We can notice that the mean value is zero for all the above variables and hence see that the normalisation has been applied properly.
result_pca<-princomp(df_final[, c(3:32)], cor = TRUE)#You can also use the prcomp function instead of princomp with cor=TRUE
summary_pca<-summary(result_pca)
summary(result_pca)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 3.6635805 2.3977788 1.66888098 1.39743453 1.2826898
## Proportion of Variance 0.4473941 0.1916448 0.09283879 0.06509411 0.0548431
## Cumulative Proportion 0.4473941 0.6390388 0.73187763 0.79697174 0.8518148
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 1.09719086 0.78224220 0.68277447 0.62792155 0.58639504
## Proportion of Variance 0.04012759 0.02039676 0.01553937 0.01314285 0.01146197
## Cumulative Proportion 0.89194243 0.91233919 0.92787856 0.94102141 0.95248338
## Comp.11 Comp.12 Comp.13 Comp.14
## Standard deviation 0.528068541 0.503843846 0.48160262 0.401349921
## Proportion of Variance 0.009295213 0.008461954 0.00773137 0.005369392
## Cumulative Proportion 0.961778593 0.970240547 0.97797192 0.983341309
## Comp.15 Comp.16 Comp.17 Comp.18
## Standard deviation 0.29833386 0.287310199 0.270933816 0.235029902
## Proportion of Variance 0.00296677 0.002751572 0.002446838 0.001841302
## Cumulative Proportion 0.98630808 0.989059650 0.991506488 0.993347790
## Comp.19 Comp.20 Comp.21 Comp.22
## Standard deviation 0.19382798 0.181997710 0.175691108 0.1628127344
## Proportion of Variance 0.00125231 0.001104106 0.001028912 0.0008835995
## Cumulative Proportion 0.99460010 0.995704205 0.996733117 0.9976167167
## Comp.23 Comp.24 Comp.25 Comp.26
## Standard deviation 0.1474461944 0.1255421684 0.1145291762 0.1070840208
## Proportion of Variance 0.0007246793 0.0005253612 0.0004372311 0.0003822329
## Cumulative Proportion 0.9983413960 0.9988667572 0.9993039883 0.9996862212
## Comp.27 Comp.28 Comp.29 Comp.30
## Standard deviation 0.0836331425 4.086534e-02 2.476666e-02 1.164032e-02
## Proportion of Variance 0.0002331501 5.566586e-05 2.044624e-05 4.516570e-06
## Cumulative Proportion 0.9999193713 9.999750e-01 9.999955e-01 1.000000e+00
screeplot(result_pca)
10. Plot the following two plots and use patchwork/gridExtra to position the two plots side by side: a. proportion of variance explained over the number of principal components
car_var<-paste("pc", c(1:30), sep = '')
df_pca<-data.frame(PC_comp = car_var, sdev = result_pca$sdev)
df_pca<-mutate(df_pca,var = (sdev)^2)
a = sum(df_pca$var)
df_pca<-mutate(df_pca,var_prop = var/a)
a = sum(df_pca$var)
df_pca$cum_var<-cumsum(df_pca$var_prop)
p1<-df_pca[c(1:20), ]%>%
ggplot(aes(x = reorder(PC_comp, -var_prop), y = var_prop))+geom_col()+labs(x = "Principal Components", y = "Proportional Variance", title ="Proportion of Variance(First 20 PC)")+theme(plot.title = element_text(hjust = 0.5))
p2<-df_pca[c(1:20), ]%>%
ggplot(aes(x = 1:20, y = cum_var))+geom_point()+geom_line()+scale_x_discrete(limits=c(1:20), labels=paste("pc", c(1:20), sep=""))+geom_hline(yintercept=0.88, linetype = "dashed", color ="red")+geom_hline(yintercept=0.95, linetype = "dashed", color="green")+labs(x = "Principal Components", y = "Cumalative Variance", title ="Cumulative Variance(First 20 PC)")+theme(plot.title = element_text(hjust = 0.5))
## Warning: Continuous limits supplied to discrete scale.
## Did you mean `limits = factor(...)` or `scale_*_continuous()`?
grid.arrange(p1, p2, ncol=2, top="Variance")
Note: please remember to clearly label your plots with titles, axis labels and legends when appropriate.
df_pca$var_prop[c(1:3)]*100
## [1] 44.739406 19.164478 9.283879
Answer 11.: 44.7%, 19.2%, 9.3% proportion of variance captured from the first, second and third components. 6 principal components are needed to capture 88% variance and 10 principal components are needed to capture 95% of variance.
pc_final_1<-data.frame(load = result_pca$loadings[,1])%>%arrange(desc(load))
head(pc_final_1)
## load
## concave.points_mean 0.2597346
## concavity_mean 0.2574969
## concave.points_worst 0.2521591
## compactness_mean 0.2380904
## perimeter_worst 0.2367356
## concavity_worst 0.2323810
Answer 12. The variables concave points_mean , concavity_mean, concave points_worst contribute most to the variance captured from PC1, PC2 and PC3. 14. Because of the relatively large number of variables in this dataset, it's very difficult to see the biplot clearly. Use the autoplot() function in package "ggfortify" to display a clearer biplot overlaid with a scatter plot for the first 2 principal components.
biplot(result_pca, choices = 1:2)
autoplot(result_pca, data = df_final, x=1, y = 2, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC2")+ theme(plot.title = element_text(hjust = 0.5))
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
15. Plot a grid of 3 x 2 scatter plots, where each plot is a scatter plot between two of the first 4 principal components, with different colors for each diagnosis group. For example, in grid cell (1,1), you should plot a scatter plot where the x-axis is PC1 and y-axis is PC2 where observations are colored according to the diagnosis group. Remember to adjust the plot display size so that you can see it clearly. Add legends and labels when appropriate. What are your observations?
pca12<-autoplot(result_pca, data = df_final, x=1, y = 2, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC2")+ theme(plot.title = element_text(hjust = 0.5))
pca13<-autoplot(result_pca, data = df_final, x=1, y = 3, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC3")+ theme(plot.title = element_text(hjust = 0.5))
pca14<-autoplot(result_pca, data = df_final, x=1, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC1 and PC4")+ theme(plot.title = element_text(hjust = 0.5))
pca23<-autoplot(result_pca, data = df_final, x=2, y = 3, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC2 and PC3")+ theme(plot.title = element_text(hjust = 0.5))
pca24<-autoplot(result_pca, data = df_final, x=2, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC2 and PC4")+ theme(plot.title = element_text(hjust = 0.5))
pca34<-autoplot(result_pca, data = df_final, x=3, y = 4, geom='point', colour = 'diagnosis')+labs(title = "Principal component analysis using PCA between PC3 and PC4")+ theme(plot.title = element_text(hjust = 0.5))
library(gridExtra)
grid.arrange(pca12,pca13,pca14,pca23, pca24, pca34, ncol=2, nrow=3, top = "PCA Analysis of various components")
Answer 15. We can clearly see that for the plots lagging component 1 the two classes overlap with each other and do not seperate well. This may be because the first component captures 44% of variance in the data and is neccessary to obtain distinction between two kinds of diagnosis. We see that most of the variance along the captured along the first principal component divides both the groups/classes at PC_1 = 0(i.e. x = 0) ## Hierarchical Clustering
library("cluster")
df.dissimilarity<-daisy(df_final[, c(3:32)], metric ="euclidean")
df.cluster.complete<-hclust(df.dissimilarity, method = "complete")
plot(df.cluster.complete)
df.cluster.4<-cutree(df.cluster.complete,k = 4, h=1)
table(df.cluster.4)
## df.cluster.4
## 1 2 3 4
## 16 187 364 2
df.cluster.4.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.4)
table(df.cluster.4.diagnosis)
## Cluster
## Diagnosis 1 2 3 4
## B 0 29 326 2
## M 16 158 38 0
Answer 18. If you want to clasify each cluster you can go with the class number maximum in that particular cluster. So, Cluster 1 is Malignant Cluster 2 is Malignant, Cluster 3 is benign and cluster 4 is benign. Here 38 Benign out of total 357 Benign have been misclassified(10.7%). And 29 Malignant out of total 212 (13.7%) have been misclassified. Therefore total 67/569 i.e. 11.67% of the total have been misclassified. 19. Now try 6 clusters with and plot dendrograms for hierarchical clustering using Ward’s linkage. Use table() function to view the clustering result. How would you label each of these 6 clusters? Does this clustering work better or worse than the clustering result in the previous question? Give brief explanations.
df.cluster.ward<-hclust(df.dissimilarity, method = "ward.D")
df.cluster.ward.6 <-cutree(df.cluster.ward,k = 6)
table(df.cluster.ward.6)
## df.cluster.ward.6
## 1 2 3 4 5 6
## 104 58 94 155 140 18
plot(df.cluster.ward)
df.cluster.6.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.ward.6)
table(df.cluster.6.diagnosis)
## Cluster
## Diagnosis 1 2 3 4 5 6
## B 0 5 53 146 136 17
## M 104 53 41 9 4 1
Answer 19. Going by the maximum count in a particular cluster. Cluster 1 belongs to Malignant, Cluster 2 belongs to Malignant. Cluster 3 can be classified as Bbenign, although the class of benign exceeds the malignant class by only few numbers. Cluster 4 is Benign, Cluster 5 is Benign and Cluster 6 is Benign. Here 55 Benign out of total 357 (15%) and 5 malignant out of total 212(2.3%) i.e. 60 out of total 569 (10.5%) have been missclassified. When compared to the 4 cluster groups, the misclassification rate is lower in 6 cluster groups, but the problem with six cluster groups is that the group 3 can't be put into any either malignant or benign. Since it includes both of them roughly equal.
kc<-kmeans(df_final[, 3:32], 2)
df_kmeans_plot<-data.frame(Diagnosis = df_final$diagnosis, Cluster = kc$cluster)
table(df_kmeans_plot)
## Cluster
## Diagnosis 1 2
## B 346 11
## M 33 179
library("factoextra")
fviz_cluster(kc, data = df_final[, 3:32])
df_pca_comp = data.frame(PCA_Cluster = as.numeric(as.logical((pca12$data$Comp.1)<0)),Diagnosis = df_final$diagnosis )
table(df_pca_comp$Diagnosis, df_pca_comp$PCA_Cluster)
##
## 0 1
## B 29 328
## M 192 20
Misclassified B:29/357(8.12%); M:20/212(9.4%); total:49/569(8.7%) ## For Hiearchial Clustering the results are:
df.cluster.4.diagnosis<-data.frame(Diagnosis = df_final$diagnosis, Cluster = df.cluster.4)
table(df.cluster.4.diagnosis)
## Cluster
## Diagnosis 1 2 3 4
## B 0 29 326 2
## M 16 158 38 0
Misclassified(4 cluster): B:38/357(10.7%); M:29/ 212 (13.7%),total 67/569(11.67% ) Misclassified(6 cluster): B:55/357(15%); M:5/ 212 (2.35%),total 60/569(10.5% ) ## For K-means the results are
table(df_kmeans_plot)
## Cluster
## Diagnosis 1 2
## B 346 11
## M 33 179
Misclassified: B:33/357(9.2%); 11/ 212 (5.88%),total 44/569(7.8% )
Based on Misclassification rate we can see that the K-Means clustering performs best followed by the, PCA Classification followed by the Hiearchial Clustering.